add option to use synthetic input data #1632

alfuyao1986 · 2025-08-25T05:26:15Z

No description provided.

meta-cla · 2025-08-25T05:26:21Z

Hi @alfuyao1986!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

tianyu-l

Please justify the value of this change, following https://github.com/pytorch/torchtitan/blob/main/CONTRIBUTING.md#proof-of-value

In particular, why fake data is better than the default c4 / c4_test?

tianyu-l · 2025-08-25T05:30:23Z

torchtitan/datasets/hf_datasets.py

+
+    def __iter__(self) -> Iterator[Tuple[Dict[str, torch.Tensor], torch.Tensor]]:
+        while True:
+            inputs = torch.randint(


This is fake data, not "synthetic" data.

Oh, how about call it random data? The goal is to remove dataset dependency for quick performance benchmarking.

Default c4 will have two problems:

Although small enough, but still have dependency for user to download before testing, and in rapid debugging and reruns, it is possible to hit HF request limit. Other case is in an unstable network, also affecting smooth development. I had to make local changes like this so I can developing without worry about dataset. Guess many users also has similar experience.

With larger models and bigger batch size runs, it will easily loopback data, but same reason as 1, it may be limited or time consuming in many cases for user to download very large dataset.

Random dataset is usually very useful when debugging the CPU overhead brought by data loading, though I'm not sure if we already have such a use case. Multimodal may be benefit from random dataset.

@alfuyao1986

Although small enough, but still have dependency for user to download before testing, and in rapid debugging and reruns, it is possible to hit HF request limit. Other case is in an unstable network, also affecting smooth development. I had to make local changes like this so I can developing without worry about dataset. Guess many users also has similar experience.

We have c4_test stored in the repo
https://github.com/pytorch/torchtitan/tree/main/tests/assets/c4_test

With larger models and bigger batch size runs, it will easily loopback data, but same reason as 1, it may be limited or time consuming in many cases for user to download very large dataset.

What would be the advantage of using random / fake data versus looping back on c4_test?

@fegin

Multimodal may be benefit from random dataset.

As we don't have multimodal training, I think the main thing I'd like to understand what's the benefit of adding random data on top of existing c4_test.

Random dataset generally can skip the overhead of data loading, like actually reading from a disk. This is not related to whether the dataset is large or small. But as mentioned above, this may be more useful when we start to see dataloader overhead is a big thing. As for development efficiency, I didn't encounter such an issue, so I should not be the one to answer.

This is solely my opinion.

Oh, given c4_test is already pre-stored in the repo, for most of the cases, it should be fine. I am actually completely fine with using pre-stored c4_test dataset. Only two more consideration just bring up for discussion.

Random dataset can usually stress the whole stack better, numerically and computationally, vs. a small repeated dataset, but it is debatable that whether this additional stress practically realistic and necessary.

Other frameworks (MaxText, Megatron-LM) do provide "synthetic/mock" data options for fast benchmarking, for ease of comparison point of view, it may be better to have a matching option.

Got it, thanks for the context!

From my perspective, the value of this dataset is somewhat limited, given we already have c4_test which doesn't involve randomness so has become a standard way for numerical testing even when parallelism / world size changes.

That said, if people have strong opinion to add this dataset, I'm OK, too. If that's the case, I would suggest making a new builder function & file, instead of piggyback on existing build_hf_dataloader. I understand that would make it harder to switch to this new dataset from config, but that's not a good reason to reuse.

yes, we definitely shouldn't use build_hf_dataloader for random dataset. There is actually another benefit of random dataset (when it has a deterministic option) -- debugging checkpoint issue. Given that the dataloader is controlled by other package, having a random dataset with a deterministic option will make debugging checkpoint inconsistency easier, at least we can rule out the dataset/dataloader problem.

meta-cla · 2025-08-25T06:10:28Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

add option to use synthetic input data

6741fd4

alfuyao1986 requested review from tianyu-l, fegin, wwwjn and wconstab as code owners August 25, 2025 05:26

tianyu-l requested changes Aug 25, 2025

View reviewed changes

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add option to use synthetic input data #1632

add option to use synthetic input data #1632

alfuyao1986 commented Aug 25, 2025

Uh oh!

meta-cla bot commented Aug 25, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Aug 25, 2025

Uh oh!

alfuyao1986 Aug 25, 2025

Uh oh!

alfuyao1986 Aug 25, 2025

Uh oh!

fegin Aug 25, 2025

Uh oh!

tianyu-l Aug 25, 2025

Uh oh!

fegin Aug 25, 2025

Uh oh!

alfuyao1986 Aug 26, 2025

Uh oh!

tianyu-l Aug 26, 2025

Uh oh!

fegin Aug 26, 2025

Uh oh!

meta-cla bot commented Aug 25, 2025

Uh oh!

Uh oh!

add option to use synthetic input data #1632

Are you sure you want to change the base?

add option to use synthetic input data #1632

Conversation

alfuyao1986 commented Aug 25, 2025

Uh oh!

meta-cla bot commented Aug 25, 2025

Action Required

Process

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

meta-cla bot commented Aug 25, 2025

Uh oh!

Uh oh!